NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs

Carver, Benjamin; Zhang, Jingyuan; Wang, Haoliang; Mahadik, Kanak; Cheng, Yue (March 2026, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity.
more » « less
Free, publicly-accessible full text available March 22, 2027
DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

Ahmad, Sohaib; Yang, Qizheng; Wang, Haoliang; Sitaraman, Ramesh; Guan, Hui (May 2025, Proceedings of the 8 th MLSys Conference)

Free, publicly-accessible full text available May 18, 2026
DIFFSERVE: EFFICIENTLY SERVING TEXT-TO-IMAGE DIFFUSION MODELS WITH QUERY-AWARE MODEL SCALING

Ahmad, Sohaib; Yang, Qizheng; Wang, Haoliang; Sitaraman, Ramesh; Guan, Hui (May 2025, Proceedings of the 8 th MLSys Conference)

Free, publicly-accessible full text available May 17, 2026
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling

Ahmad, Sohaib; Yang, Qizheng; Wang, Haoliang; Sitaraman, Ramesh; Guan, Hui (May 2025, Proceedings of the 8th MLSys Conference, Santa Clara, CA, USA, 2025)

Free, publicly-accessible full text available May 12, 2026
Algorithmic Fairness Generalization under Covariate and Dependence Shifts Simultaneously”

Zhao, Chen; Jiang, Kai; Wu, Xintao; Wang, Haoliang; Khan, Latifur; Grant, Christan_Earl; Chen, Feng (August 2024, Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD))

Full Text Available
Alps: An Adaptive Learning, Priority OS Scheduler for Serverless Function

Fu, Yuqi; Shi, Ruizhe; Wang, Haoliang; Chen, Songqing; Cheng, Yue (July 2024, Proceedings of the 2024 USENIX Annual Technical Conference)

Full Text Available
ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions

Fu, Yuqi; Shi, Ruizhe; Wang, Haoliang; Chen, Songqing; Cheng, Yue (July 2024, 2024 USENIX Annual Technical Conference (ATC 2024))

FaaS (Function-as-a-Service) workloads feature unique patterns. Serverless functions are ephemeral, highly concurrent, and bursty, with an execution duration ranging from a few milliseconds to a few seconds. The workload behaviors pose new challenges to kernel scheduling. Linux CFS (Completely Fair Scheduler) is workload-oblivious and optimizes long-term fairness via proportional sharing. CFS neglects the short-term demands of CPU time from short-lived serverless functions, severely impacting the performance of short functions. Preemptive shortest job first—shortest remaining process time (SRPT)—prioritizes shorter functions in order to satisfy their short-term demands of CPU time and, therefore, serves as a best-case baseline for optimizing the turnaround time of short functions. A significant downside of approximating SRPT, however, is that longer functions might be starved. In this paper, we propose a novel application-aware kernel scheduler, ALPS (Adaptive Learning, Priority Scheduler), based on two key insights. First, approximating SRPT can largely benefit short functions but may inevitably penalize long functions. Second, CFS provides necessary infrastructure support to implement user-defined priority scheduling. To this end, we design ALPS to have a novel, decoupled scheduler frontend and backend architecture, which unifies approximate SRPT and proportional-share scheduling. ALPS’ frontend sits in the user space and approximates SRPT-inspired priority scheduling by adaptively learning from an SRPT simulation on a recent past workload. ALPS’ backend uses eBPF functions hooked to CFS to carry out the continuously learned policies sent from the frontend to inform scheduling decisions in the kernel. This design adds workload intelligence to workload-oblivious OS scheduling while retaining the desirable properties of OS schedulers. We evaluate ALPS extensively using two production FaaS workloads (Huawei and Azure), and results show that ALPS achieves a reduction of 57.2% in average function execution duration compared to CFS.
more » « less
Full Text Available
Probabilistic simulation supports generalizable intuitive physics

Wang, Haoliang; Jedoui, Khalid; Venkatesh, Rahul; Binder, Felix; Tenenbaum, Joshua B; Fan, Judith; Yamins, Daniel; Smith, Kevin A (July 2024, Cognitive Science Society)

How do people perform general-purpose physical reasoning across a variety of scenarios in everyday life? Across two stud ies with seven different physical scenarios, we asked participants to predict whether or where two objects will make contact. People achieved high accuracy and were highly consistent with each other in their predictions. We hypothesize that this robust generalization is a consequence of mental simulations of noisy physics. We designed an “intuitive physics engine” model to capture this generalizable simulation. We find that this model generalized in human-like ways to unseen stimuli and to a different query of predictions. We evaluated several state-of-the-art deep learning and scene feature models on the same task and found that they could not explain human predictions as well. This study provides evidence that human’s robust generalization in physics predictions are supported by a probabilistic simulation model, and suggests the need for structure in learned dynamics models.
more » « less
Full Text Available
GMorph: Accelerating Multi-DNN Inference via Model Fusion

Yang, Qizheng; Yang, Tianyi; Xiang, Mingcan; Zhang, Lijun; Wang, Haoliang; Serafini, Marco; Guan, Hui (April 2024, ACM)

AI-powered applications often involve multiple deep neural network (DNN)-based prediction tasks to support application level functionalities. However, executing multi-DNNs can be challenging due to the high resource demands and computation costs that increase linearly with the number of DNNs. Multi-task learning (MTL) addresses this problem by designing a multi-task model that shares parameters across tasks based on a single backbone DNN. This paper explores an alternative approach called model fusion: rather than training a single multi-task model from scratch as MTL does, model fusion fuses multiple task-specific DNNs that are pre-trained separately and can have heterogeneous architectures into a single multi-task model. We materialize model fusion in a software framework called GMorph to accelerate multi- DNN inference while maintaining task accuracy. GMorph features three main technical contributions: graph mutations to fuse multi-DNNs into resource-efficient multi-task models, search-space sampling algorithms, and predictive filtering to reduce the high search costs. Our experiments show that GMorph can outperform MTL baselines and reduce the inference latency of multi-DNNs by 1.1-3X while meeting the target task accuracy.
more » « less
Full Text Available
GMorph: Accelerating Multi-DNN Inference via Model Fusion

Yang, Qizheng; Yang, Tianyi; Xiang, Mingcan; Zhang, Lijun; Wang, Haoliang; Serafini, Marco; Guan, Hui (April 2024, ACM EuroSys'24)

AI-powered applications often involve multiple deep neural network (DNN)-based prediction tasks to support application level functionalities. However, executing multi-DNNs can be challenging due to the high resource demands and computation costs that increase linearly with the number of DNNs. Multi-task learning (MTL) addresses this problem by designing a multi-task model that shares parameters across tasks based on a single backbone DNN. This paper explores an alternative approach called model fusion: rather than training a single multi-task model from scratch as MTL does, model fusion fuses multiple task-specific DNNs that are pre-trained separately and can have heterogeneous architectures into a single multi-task model. We materialize model fusion in a software framework called GMorph to accelerate multi- DNN inference while maintaining task accuracy. GMorph features three main technical contributions: graph mutations to fuse multi-DNNs into resource-efficient multi-task models, search-space sampling algorithms, and predictive filtering to reduce the high search costs. Our experiments show that GMorph can outperform MTL baselines and reduce the inference latency of multi-DNNs by 1.1-3X while meeting the target task accuracy.
more » « less
Full Text Available

« Prev Next »

Search for: All records